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To help a user specify and verify quantified queries — a 
class of database queries known to be very challenging for 
all but the most expert users — one can question the user on 
whether certain data objects are answers or non- answers to 
her intended query. In this paper, we analyze the number of 
questions needed to learn or verify qhorn queries, a special 
class of Boolean quantified queries whose underlying form is 
conjunctions of quantified Horn expressions. We provide op- 
timal polynomial- question and polynomial-time learning and 
verification algorithms for two subclasses of the class qhorn 
with upper constant limits on a query's causal density. 

Categories and Subject Descriptors 

H.2.3 [Database Management]: Languages — query lan- 
guages; 1.2.2 [Artificial Intelligence]: Automatic Pro- 
gramming — program synthesis, program verification; 1.2.6 
[Artificial Intelligence]: Learning — concept learning 

Keywords 

quantified boolean queries, qhorn, query learning, query ver- 
ification, example- driven synthesis 

1. INTRODUCTION 

It's a lovely morning, and you want to buy a box of choco- 
lates for your research group. You walk into a chocolate store 
and ask for "a box with dark chocolates — some sugar-free 
with nuts or filling". However, your server is a pedantic lo- 
gician who expects first-order logic statements. In response 
to your informal query he places in front of you a hundred 
boxes! Despite your frustration, you are intrigued: you open 
the first box only to find one dark, sugar- free chocolate with 
nuts and many other varieties of white chocolates that you 
didn't order. You push it aside, indicating your disapproval, 
and proceed to the second. Inside, you are wondering: Is 
there hope that I can communicate to this person my needs 
through a sequence of such interactions? 



Everyday, we request things from each other using infor- 
mal and incomplete query specifications. Our casual inter- 
actions facilitate such under-specified requests because we 
have developed questioning skills that help us clarify such 
requests. A typical interlocutor might ask you about cor- 
ner cases, such as the presence of white chocolates in the 
box, to get to a precise query specification by example. As 
requesters, we prefer to begin with an outline of our query 
— the key properties of the chocolates — and then make 
our query precise using a few examples. As responders, we 
can build a precise query from the query outline and a few 
positive or negative examples — acceptable or unacceptable 
chocolate boxes. 

Typical database query interfaces behave like our logi- 
cian. SQL interfaces, for example, force us to formulate 
precise quantified queries from the get go. Users find quan- 
tified query specification extremely challenging [l][l4 . Such 
queries evaluate propositions over sets of tuples rather than 
individual tuples, to determine whether a set as a whole sat- 
isfies the query. Inherent in these queries are (i) the grouping 
of tuples into sets, and (ii) the binding of query expressions 
with either existential or universal quantifiers. Existential 
quantifiers ensure that some tuple in the set satisfies the ex- 
pression, while universal quantifiers ensure that all tuples in 
the set satisfy the expression. 

To simplify the specification of quantified queries, we built 
DataPlay 1 . DataPlay tries to mimic casual human inter- 
actions: users first specify the simple propositions of a query. 
DataPlay then generates a simple quantified query that con- 
tains all the propositions. Since this query may be incorrect, 
users can label query results as answers or non- answers to 
their intended query. DataPlay uses this feedback on ex- 
ample tuple-sets to fix the incorrect query. Our evaluation 
of DataPlay shows that users prefer example- driven query 
specification techniques for specifying complex quantified 
queries [l]. Motivated by these findings, we set out to answer 
the question: How far can we push the example- driven query 
specification paradigm? This paper studies the theoretical 
limits of using examples to learn and to verify a special sub- 
class of quantified queries, which we call qhorn, in the hope 
of eventually making query interfaces more human-like. 
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1.1 Our contributions 

We formalize a query learning model where users spec- 
ify propositions that form the building blocks of a Boolean 
quantified query. A learning algorithm then asks the users 
membership questions: each question is an example data 
object, which the user classifies as either an answer or a 



non- answer. After a few questions, the learning algorithm 
terminates with the unique query that satisfies the user's 
responses to the membership questions. The key challenge 
we address in this paper is how to design a learning algo- 
rithm that runs in polynomial time, asks as few questions 
as possible and exactly identifies the intended query. 
We prove the following: 

1. Learning quantified Boolean queries is intractable: A 



doubly exponential number of questions is required (§2) 

Within a special class of quantified Boolean queries 

known as qhorn (§2.1) we prove two subclasses are 



exactly and efficiently learnable: qhorn- 1 |(§2.1.3) and 



limits on causal density (Def. 2.6) 



its superset role-preserving qhorn (§2.1.4)|with constant 



2. We design an optimal algorithm to learn qhorn- 1 queries 
using 0{n Ig n) questions where n is the number of propo- 
sitions in a query (§3.1) 



3. We design an efficient algorithm to learn role-preserving 
qhorn queries using 0{kn\gn-\-n ^^) questions where k 
(Def. 2.5JJ and is causal density (§3.2) 



IS query size 



We also formalize a query verification model where the user 
specifies an entire query within the role-preserving qhorn 
query class. A verification algorithm then asks the user a 
set of membership questions known as the verification set. 
Each query has a unique verification set. The verification 
algorithm classifies some questions in the set as answers and 
others as non-answers. The query is incorrect if the user 
disagrees with any of the query's classification of questions 
in the verification set. 

We design a verification algorithm that asks 0{k) mem- 
bership questions (§4) 



2. PRELIMINARIES 

Before we describe our query learning and verification al- 
gorithms, we first describe our data model — nested rela- 
tions — and the qhorn query class. 

Definition 2.1. Given the sets Di,D2, ...,Dm, Tl is 
a relation on these m sets if it is a set of m-tuples 
{di, G?2, ..., dm) such that di G Di for i = 1, ..., m. Di, ..., Dm 
are the the domains oflZ. 

Definition 2.2. A nested relation IZ has at least one 
domain Di that is a set of subsets (powerset) of another 
relation IZi. This IZi is said to he an embedded relation of 

n. 

Definition 2.3. A relation IZ is a flat relation if all its 
domains Di, ..., Dm are not powersets of another relation. 

For example, a flat relation of chocolates can have the 
following schema: 

Chocolate (isDark, hasFilling, isSugarFree, 
hasNuts, origin) 

A nested relation of boxes of chocolates can have the fol- 
lowing schema: 

Box (name. Chocolate (isDark, hasFilling, 
isSugarFree, hasNuts, origin)) 



In this paper, we analyze queries over a nested relation with 
single- level nesting, i.e. the embedded relation is flat. The 
Box relation satisfies single-level nesting as the Chocolate 
relation embedded in it is fiat. To avoid confusion, we refer 
to elements of the nested relation as objects and elements 
of the embedded flat relation as tuples. So the boxes are 
objects and the individual chocolates are tuples. 

Definition 2.4. A Boolean query maps objects into ei- 
ther answers or non-answers. 

The atoms of a query are Boolean propositions such as: 

pi : c. isDark, p2 : c. hasFilling, 
ps : c. origin = Madagascar 

A complete query statement assigns quantifiers to expres- 
sions on propositions over attributes of the embedded rela- 
tion. For example: 



Vc G Box. Chocolates (pi) A 
3c G Box. Chocolates (p2 Aps) 



(1) 



A box of chocolates is an answer to this query if every 
chocolate in the box is dark and there is at least one choco- 
late in the box that has filling and comes from Madagascar. 

Given a collection of propositions, we can construct an 
abstract Boolean representation for the tuples of the nested 
relation. For example, given propositions pi,P2,P3, we 
can transform the chocolates from the data domain to the 
Boolean domain as seen in Figure^ 
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Figure 1: Transforming data from its domain into a 
Boolean domain. 

Thus, each proposition pi is replaced with a Boolean vari- 
able Xi. We rewrite the Boolean query (IT]) as follows: 

yteS (xi) A 
3t e S 1x2 Axs) 

where S is the set of Boolean tuples for an object. This 
Boolean representation allows us to create learning and ver- 
ification algorithms independent of the data domain or of 
the actual propositions that the user writes. 

To support this Boolean representation of tuples, how- 
ever, we assume that (i) it is relatively efficient to con- 
struct an actual data tuple from a Boolean tuple and that 
(ii) the true/false assignment to one proposition does not 
interfere with the true/false assignments to other proposi- 
tions. The propositions pm '■ c. origin = Madagascar and 
Pb : c. origin = Belgium interfere with each other as a 
chocolate cannot be both from Madagascar and Belgium: 
Pm -^ -'Pb and Pb -^ ^Pm- 

With three propositions, we can construct 2^ possible 
Boolean tuples, corresponding to the 2^ possible true or false 



assignments to the individual propositions, i.e. we can con- 
struct 8 different chocolate classes. With n propositions, we 
can construct 2^ Boolean tuples. 

There are 2^ possible sets of Boolean tuples or unique 
objects. With our three chocolate propositions, we can con- 
struct 256 boxes of distinct mixes of the 8 chocolate classes. 
Since a Boolean query maps each possible object into an 

answer or a non-answer, it follows that there are 2 dis- 
tinguishable Boolean queries (for n = 3, about 10^^). If 
our goal is to learn any query from n simple propositions 
by asking users to label objects as answers or non- answers, 
i.e. asking membership questions^ then we would have to 

distinguish between 2^ queries using Q(lg(2^ )) or 2^ 
questions. 

Since this ambitious goal of learning any query with few 
membership questions is doomed to fail, we have to constrain 
the query space. We study the learnability of a special space 
of queries, which we refer to as qhorn. 

2.1 Qhorn 

Qhorn has the following properties: 

1. It supports if-then query semantics via quantified Horn 
expressions: Vt G 5* (xi A X2 ^ xs). A Horn expression 
has a conjunction of body variables that imply a single 
head variable. The degenerate headless Horn expression 
is simply a quantified conjunction of body variables (3t G 
S{xi AX2)) and the degenerate hodyless Horn expression 
is simply a single quantified variable (Vt G 5'(T ^ xi) = 
VtG5'(xi)). 

2. It requires at least one positive instance for each Horn 
expression via a guarantee clause. Thus, we add the 
existential clause 3t ^ S (xi Ax2 Axs) to the expression 
Vt G S (xi A X2 ^ X3) to get a complete query. Note 
that the expression 3t ^ S (xi A X2 ^ X3) is implied by 
its guarantee clause 3t ^ S (xi A X2 A X3). 

We justify the naturalness of guarantee clauses with the 
following example: consider a user looking for a box 
of only sugar- free chocolates. Without the guarantee 
clause, an empty box satisfies the user's query. While 
such a result is logical, we contend that most users would 
not consider the result as representative of sugar-free 
chocolate boxes. 

3. It represents queries in a normalized form: conjunctions 
of quantified (Horn) expressions. 

We use a shorthand notation for queries in qhorn. We 
drop the implicit 't G 5", the 'A' symbol and the guarantee 
clause. Thus, we write the query 

yt e s (xi A X2 ^ X3) A 3t e s (xi a X2 a x3)a 
\/t e S (x4) A 3t e S (x4) A 3t e S (xs) 

as VxiX2 -^ X3 Vx4 3x5. 
2.7.7 Qhorn 's Equivalence Rules 

Rl The query representation 3x1X2X3 3xiX2 3x2X3 is equiv- 
alent to 3x1X2X3. This is because if a set contains a tuple 
that satisfies 3x1X2X3, that tuple will also satisfy 3xiX2 
and 3x2X3. An existential conjunction over a set of vari- 
ables dominates any conjunction over a subset of those 
variables. 



R2 The query representation Vx 1X2X3 -^ h \fx1X2 -^ 
h Vxi -^ h is equivalent to Vxi -^ h 3x1X2X3 -^ h. 
This is because h has to be true whenever xi is true 
regardless of the true/false assignment of X2,X3. Thus 
a universal Horn expression with body variables B and 
head variable h dominates any universal Horn expres- 
sion with body variables B' and head variable h where 
B' D B. 

R3 The query representation Vxi -^ h 3xiX3 is equivalent 
to Vxi -^ h 3x1X3/1. Again, this equivalence is because 
h has to be true whenever xi is true. 

2.7.2 Learning with Membership Questions 

A membership question is simply an object along with 
its nested data tuples. The user responds to such a question 
by classifying the object as an answer or a non-answer for 
their intended query. 

Given a collection of n propositions on the nested relation, 
the learning algorithm constructs a membership question in 
the Boolean domain: a set of Boolean tuples on n Boolean 
variables xi , ..., Xn — a variable for each proposition. Such a 
set is transformed into an object in the data domain before 
presentation to the user. 

For brevity, we describe a membership question in the 
Boolean domain only. As a notational shorthand, we use 1^ 
to denote a Boolean tuple where all variables are true. We 
use lowercase letters for variables and uppercase letters for 
sets of variables. 

The following definitions describe two structural proper- 
ties of qhorn queries that influence its learnability: 

Definition 2.5. Query size, k, is the number of expres- 
sions in the query. 

Definition 2.6. Causal Density, 0, is the maximum 
number of distinct non-dominated universal Horn expres- 
sions for a given head variable h. 

Conceptually, universal Horn expressions represent causa- 
tion: whenever the body variables are true, the head variable 
has to be true. If a head variable has many universal Horn 
expressions, it has many causes for it to be true and thus 
has a high causal density. 

The following inequality between causal density, and 
query size k holds: ^ < < k. We would expect users' 
queries to be small in size k — 0{n) and to have low causal 
density 0. 

A query class is efficiently learnable if (i) the number of 
membership questions that a learning algorithm asks the 
user is polynomial in the number of propositions n and query 
size k and (ii) the learning algorithm runs in time polynomial 
in n and k. Question generation needs to be in polynomial 
time to ensure interactive performance. This requirement 
entails that the number of Boolean tuples per question is 
polynomial in n and k. A query class is exactly learnable if 
we can learn the exact target query that satisfies the user's 
responses to the membership questions. 

Due to the following theorem, qhorn cannot be efficiently 
and exactly learned with a tractable number of questions 
(even when query size is polynomially bounded in the num- 
ber of propositions {k — n) and causal density has an upper 
bound of one (0 — 1)). 



Theorem 2.1. Learning qhorn queries where variables 
can repeat r > 2 times requires Q(2^) questions. 

Proof: Suppose we split our n variables into two disjoint 
subsets: X, Y. Consider the query class: 



Existential variables 



Universal head variable 



Uni(X) = 
Alias (y) = 



Uni(X) A Alias(y) 

Vxi VX2...VX|X| 

V^/l -^ 2/2 V2/2 -^ 2/3...V2/I 



y\ 



yi 



(j) is simply the class of qHorn queries where some vari- 
ables, X, are universally quantified and bodyless and the 
other variables Y form an alias i.e. all variables in Y 
are either all true or all false. An example instance 
from this class of queries over variables {xi, X2, ..., xe} is: 
Uni({xi, X3, X5})AAlias({x2, X4, xe}) = VxiVxsVxs A Vx2 -^ 
X4VX4 -^ xqMxq -^ X2. Only two questions satisfy the ex- 
ample instance (i) a question with only the Boolean tuple 
1^ and (ii) a question with tuples: {1^, 101010}. 

There are 2^ query instances in the class (f). If we con- 
struct a membership question with only 1^ tuples then for 
all instances in 0, the question is an answer and we can- 
not learn the target query. If we augment the question with 
one tuple where some variables are false then all such ques- 
tions are non-answers unless exactly the false variables are 
the alias variables. Augmenting the question with two or 
more tuples will always be a non-answer even if one of the 
tuples has exactly and only the alias variables set to false. 
This is because the other tuples either have some universally- 
quantified bodyless variables that are false or some alias vari- 
ables that are true and some alias variables that are false. 

This leaves us with 2^ membership questions where each 
question satisfies exactly one target query. Consider an ad- 
versary who always responds 'non- answer'. In the worst case 
we have to ask 2^ — 1 questions as each question eliminates 
exactly one query from consideration as the target query. D 

Qhorn 's intractability does not mean that we cannot con- 
struct efficiently and exactly learnable qhorn subclasses. We 
describe two such sub-classes: 

2.1.3 Qhorn-1 

Qhorn- 1 defines certain syntactic restrictions on qhorn. 
Not counting guarantee clauses, if a query has k distinct 
expressions (1 < A: < n) and each expression i has body 
variables Bi and a head variable hi , such that B — Bi U . . . U 
Bk is the collection of all body variables and H = {hi , ...hm} 
is the set of all head variables then the following restrictions 
hold in qhorn-1: 

1. BiHBj ^H^y Bi = Bj iUy^j 

2. hi / hj \ii ^ j 

3. BnH^H} 

The first restriction ensures that different head variables 
can either share the exact same set of body variables or have 
disjoint bodies. The second restriction ensures that a head 
variable has only one body. Finally, the third restriction 
ensures that a head variable does not reappear as a body 
variable. Effectively, qhorn-1 has no variable repetition: a 
variable can appear once either in a set of body variables 
or as a head variable. The following diagram labels the 
different components of a qhorn-1 query. 

Note that qhorn-1 queries have a maximum query size 
A; of n and have a causal density of at most one. From 
an information-theoretic perspective, Q(nlgn) membership 




Existential Horn Expressions 



Universal Horn Expressions 



Figure 2: The different components of a qhorn-1 query. 



questions are required to learn a target query in qhorn- 
l.This is because qhorn-1 has 2®(nlgn) queries. We can 
think of all head variables that share the same set of body 
variables to be one part of a partition of the n Boolean vari- 
ables. If we can construct a unique qhorn-1 query from every 
partition of n variables, then a lower bound on the number 
of queries is the Bell Number Bn, i.e. the number of ways we 
can partition a set into non-empty, non-overlapping subsets. 
One way to construct a unique query for every partition is 
as follows: 

1. We universally quantify all variables that appear in a 
singleton part: Vx^ 

2. For all other parts, we pick any one variable as the head 
of an existentially quantified Horn expression with the 
remaining variables as body variables. 

For example, we construct the query Vxi Vx2 3x3 -^ X4 
3x5X6 -^ xr from the partition xi|x2|x3X4|x5X6X7. Since 
\n(Bn) = G(nlnn), a lower bound estimate on the number 
of queries in qhorn-1 is 2^ ^^. 

Note that for each part, we can have either an existential 
or a universal quantifier and we can set a variable's role as 
either a head or a body variable. Since, we can have most n 
parts, an upper bound estimate on the number of queries is 
2^ X 2^ X 2^ ^s^. Thus the size of qhorn-1 is 2®(^^sn)^ 

2.1.4 Role-preserving qhorn 

In role-preserving qhorn queries, variables can repeat 
many times, but across universal Horn expressions head vari- 
ables can only repeat as head variables and body variables 
can only repeat as body variables. For example, the follow- 
ing query is in role-preserving qhorn 

VxiX4 -^ X5 VX3X4 -^ X5 VX2X4 -^ Xq 3X1X2X3 3X1X2X5X6 

while the following query is not in role-preserving qhorn 



VxiX4 -^ X5 VX2X3X5 



X6 



because X5 appears both as a head variable and a body vari- 
able in two universally quantified Horn expressions. Existen- 
tial Horn expressions in role-preserving qhorn are rewritten 
as existential conjunctions and variables do not have roles in 
these conjunctions. Thus, existential conjunctions can con- 
tain one or more head variables (e.g. 3x1X2X5X6 in the first 
query). The following diagram labels the different compo- 
nents of a role-preserving qhorn query. 

Both query size and causal density play a role in the be- 
havior of learning and verification algorithms. Once we re- 
move the syntactic restriction of variables appearing at most 
once, the size of a target query instance is no longer polyno- 
mially bounded in n. Thus, the complexity of learning and 
verification algorithms for role-preserving qhorn queries is 



Non-head variables: Xi^X2^ X^^ X^^ X5 



Body variables 



d X3X5X5 d X 1X2X5 d X2X3X4 V [ X1X2J — ^ X4 

' . " , ' 

Existential Conjunctions Universal Horn Expressions 

Figure 3: The different components of a role-preserving 
qhorn query. 



parameterized by A:, ^ and n. We would expect user queries 
to have low causal densities and to be small in size. Provided 
that has a constant upper bound, then we can efficiently 
learn role-preserving queries. 

3. QUERY LEARNING 
3.1 Learning qhorn- 1 

Theorem 3.1. O(nlgn) questions are sufficient to learn 
qhorn- 1 queries in polynomial time. 

Proof: The learning algorithm breaks down query learning 
into a series of small tasks. First, it classifies all variables 
into either universal head variables or existential variables 
(Fig. 2 describes qhorn- 1 terminology). Second, it learns 
the body variables (if any) for each universal head variable. 
Finally, it learns existential Horn expressions. We show that 
each tas k requ ires at most O(nlgn) membership questions 
Lemmas 3.2 and|3.3|) 



(Section 3.1.1 



thus proving that the 



learning algorithm asks 0{n Ig n) questions. D 

3.1.1 Learning universal head variables 

The simplest learning task is to determine whether a vari- 
able is a universal head variable. Suppose we have three 
variables: xi, X2, X3. To determine if xi is the head of a uni- 
versal Horn expression, we ask the user if the set {111, 011} 
is an answer. By setting the other variables (x2,X3) to be 
always true, we are setting all potential body variables of xi 
to true. We are also neutralizing the effect of other unknown 
head variables on the outcome of a membership question. If 
the set {111,011} is an answer, then we are sure that xi 
is not a universal head variable because it can exist with a 
false value as long as at least one tuple has a true value for 
it. If the set is a non-answer, then we learn that xi is a 
universal head variable. 

We need one question to determine whether a variable is 
a universal head variable and we need 0{n) time to generate 
each question — the time to construct a set with two tuples 
of size n. Thus, we learn which variables are universal head 
variables, [/, and which variables are existential variables, 
E^ in polynomial time. 

3.1.2 Learning body variables of universal Horn ex- 
pressions 

Definition 3.1. Given a universal head variable h and a 
subset of existential variables V C E, a universal depen- 
dence question on h and V is a membership question with 
two tuples: 1^ and a tuple where h and V are false and all 
other variables are true. 



Algorithm 1 Find bodies of universal head variable 

h: The universal head variable 

E: The set of existential variables 

B: is the set of bodies learned so far (Bi, B2, ...) 

b ^— Find(UniversalDependence(/i, ?), Non-Answer, B1UB2U 
...UB|H|) 
if 6 / then 
for B eB do 
if be B then 
return B 
end if 
end for 
end if 

B ^— FindAll(UniversalDependence(/i, ?), Non-Answer, E) 
return B 



If a universal dependence question on h and V is an an- 
swer, then we learn that a subset of /I's body variables is in 
V. This is because when the conjunction of body variables 
is not satisfied, the head variable can be false. We say that 
h depends on some variables in 1/ . If the question is a non- 
answer, then we learn that /I's body variables are a subset 
oi E — V; h has no body variables in V because in qhorn- 1, 
h can have at most one body. 

The most straightforward way to learn the body variables, 
B, of one universal variable is with 0{\E\) = 0{n) univer- 
sal dependence questions: we serially test if h depends on 
each variable e ^ E. This means we use 0(n^) questions to 
determine the body variables for all universal variables. We 
can do better. 

Algorithm 2 Find 

Q: The question to ask 

r: The response on which we eliminate a set of variables from 

further consideration 

V: The variables to apply binary search within 

if Ask(Q(D)) = r then 

return 
else 

if \D\ = 1 then 

return D 
else 

Split D into Di (1^* half) and D2 (2^^ half) 
X ^ Find(Q, r, Di) 
if X = then 

return Find(Q, r, D2) 
else 

return x 
end if 
end if 
end if 



We perform a binary search for /I's body variables in E. If 
h has B body variables, we ask 0(|B| Ign) instead of 0{n) 
questions to determine B. Suppose we have four variables 
xi,X2,X3,X4 such that xi is a universal head variable and 
all other variables are existential variables. X2,X3,X4 are 
potential body variables for xi. If the set {1^, 0^} is a non- 
answer then xi is independent of all other variables and it 
has no body. If the set is an answer, we divide and conquer 
the variables. We ask if xi universally depends on half the 
variables, {x2,X3}, with the set {1^,0001}. If the set is a 
non-answer then we eliminate half the variables, {x2,X3}, 
from further consideration as body variables. We know that 



Algorithm 3 FindAll 



Q: The question to ask 

r: The response on which we ehminate a set of variables from 

further consideration 

V: The variables to apply binary search within 

if Ask(Q(D)) = r then 

return 
else 

if \D\ = 1 then 

return D 
else 

Split D into Di (1^* half) and D2 (2^^ half) 
return FindAn(Q, r, Di) U FindAll(Q, r, D2) 
end if 
end if 



a body variable has to exist in the remaining half and since, 
X4 is the last remaining variable, we learn the expression 
Vx4 -^ xi. If the set {1^, 0001} is an answer, then we know 
at least one body variable exists in {x2, X3} and we continue 
the search for body variables in {x2,X3}, making sure that 
we also search the other half {x^} for body variables. 

Lemma 3.2. O(nlgn) universal dependence questions are 
sufficient to learn the body variables of all universal head 
variables. 

Proof: Suppose we partition all variables into m non- 
overlapping parts of sizes /ci, A:2, ..., /cm such that J27Li ^i — 
n. Each part has at least one body variable and at least one 
universal head variable. Such a query class is in qhorn-1 as 
all body variables are disjoint across parts and head vari- 
ables cannot reappear as head variables for other bodies or 
in the bodies of other head variables. 

Given a head variable hi , we can determine its body vari- 
ables Bi using the binary search strategy above: we ask 
0{\Bi\\gn) questions (it takes O(lgn) questions to deter- 
mine one body variable). For each additional head variable, 
h'i^ that shares Bi^ we require at most llgn questions: we 
only need to determine that h'^ has one body variable in 
the set Bi. Thus to determine all variables and their roles 
in a part of size /c^ with \Bi\ body variables and \Hi\ head 
variables we need 0(\Bi\\gn + \Hi\ x llgn) = 0{ki\gn) 
questions. Since there are m parts, we ask a total of 
^(X^Iii ^i Is^) — 0{n\gn) questions. D 

Since universal dependence questions consist of two tuples 
we only need 0{ri) time to generate each question. Thus, the 
overall running time of this subtask is in polynomial time. 

3.1.3 Learning existential Horn expressions 

After learning universal Horn expressions, we have estab- 
lished some non-overlapping distinct bodies and their uni- 
versal head variables. Each variable in the remaining set 
of existential variables, can either be (i) an existential head 
variable of one of the existing bodies or (ii) an existential 
head variable of a new body or (i) a body variable in the 
new body. We use existential independence questions to dif- 
ferentiate between these cases. 

Definition 3.2. Given two disjoint subsets of existential 
variables X ^ E^Y ^ E^X ^Y — ^, an existential in- 
dependence question is a membership question with two 
tuples: (i) a tuple where all variables x ^ X are false and 



Algorithm 4 Learn existential Horn expressions 

Q: The target qhorn-1 query 

B: The set of bodies learned so far {Bi, B2, ...) 

E: The set of all existential variables 

for ee {E - (B1UB2U ... U B\^\)} do 

b -^ Find(ExistentialIndependence(e, ?), Answer, B1UB2U 
...UB|H|) 

if 6 / then 
for B eB do 
if be B then 

Q^ QA3B ^e 
end if 
end for 
else 

D ^— FindAll(ExistentialIndependence(e, ?), Answer, 
E) 

H ^ GetHead(e, D) 
if if = then 

Q^ QA3D ^e 
B^BU{D} 
else 

h ^ H[l] 

for de{D -H} do 

if Ask(ExistentialIndependence(/i, d)) then 

H ^ H\Jd 
end if 
end for 

B ^ (D-H)U {e} 
B^BU{B} 
for he H do 

Q^ QA3B ^h 
end for 
end if 
E ^ E- D 
end if 
end for 



all other variables are true and (ii) a tuple where all vari- 
ables y eY are false and all other variables are true. 

If an independence question between two existential vari- 
ables X and y is an answer then either: 

1 . X and y are existential head variables of the same body 

2. or X and y are not in the same Horn expression. 

We say that x and y are independent of each other. Two 
sets X and Y are independent of each other if all variables 
X e X are independent of all variables y eY. Conversely, if 
an independence question between x and y is a non-answer 
then either: 

1. X and y are body variables in the same body or 

2. 2/ is an existential head variable and x is in its body or 

3. X is an existential head variable and y is in its body 
We say that x and y depend on each other. If sets X and 
Y depend on each other then at least one variable x e X 
depends on one variable y eY. 

Given an existential variable e, if we discover that e de- 
pends on a body variable 6 of a known set of body variables 
B, then we learn that e is an existential head variable in the 
Horn expression: 3B ^ e. 

Otherwise, we find all existential variables D that e de- 
pends on. We can find all such variables with 0{\D\\gn) 
existential independence questions using the binary search 
strategy of Section |3.1.2| 

Knowing that D depends on e only tell us that one of 
the following holds: (i) A subset H oi D are existential 
head variables for the body of e U (D — H) or (ii) e is a 



head variable and D is a body. To differentiate between 
the two possibihties we make use of the fo Rowing rule: // 
two variables x, y depend on z but x and y are independent 
then z is a body variable and x,y are head variables. If we 
find a pair of independent variables /ii,/i2 in D, we learn 
that X must be a body variable. If we do not find a pair of 
independent variables in D then we may assume that x is 
an existential head variable and all variables in D are body 
variables. 

After finding head variables in D, we can determine the 
roles of the remaining variables in D with |I^| = 0{n) inde- 
pendence questions between hi and each variable d G D — hi. 
If /ii and d are independent then d is an existential head 
variable, otherwise d is a body variable. 

Our goal, therefore, is to locate a definitive existential 
head variable in D by searching for an independent pair of 
variables. 

Definition 3.3. An independence matrix question 

on D variables consists of \D\ tuples. For each variable 
d ^ D, there is one tuple in the question where d is false 
and all other variables are true. 

Suppose we have four variables xi,...,X4; D = {x2,X3,X4} 
and D depends on xi. {1011, 1101, 1110} is a matrix ques- 
tion on D. If such a question is an answer then there is 
at least a pair of head variables in D: the question will al- 
ways contain a pair of tuples that ensure that each head and 
the body is true. For example if X2,X4 are head variables 
then tuples {1011,1110} in the question satisfy the Horn 
expressions: BxiXs -^ X2,3xiX3 -^ x^. If at most one vari- 
able in L) is a head variable, then there is no tuple in the 
matrix question where all body variables are true and the 
head variable is true and the question is a non-answer. For 
example, if only X4 is a head variable, then the tuple, 1111 
that satisfies the Horn expression 3x1X2X3 -^ X4 is absent 
from the question. 

Lemma 3.3. Given an existential variable x and its de- 
pendents D, we can find an existential head variable in D 
with 0{\D\\g\D\) independence matrix questions of 0{\D\) 
tuples each if at least two head variables exist in D. 

Proof. Consider the 'GetHead' procedure in Alg. [5]that 
finds an existential head variable in the set D of dependents 
of variable x. The central idea behind the 'GetHead' pro- 
cedure is if the user responds that a matrix question on Di 
(Di C D) is an answer, then a pair of head variables must 
exist in Di and we can eliminate the remaining variables 
D — Di from further consideration. Otherwise, we know that 
at most one head variable exists in Di and another exists in 
D — Di so we can eliminate Di from further consideration 
and focus on finding the head variable in D — Di. 

Each membership question eliminates half the variables 
from further consideration as head variables. Thus, we re- 
quire only 0(lg|D|) = O(lgn) questions to pinpoint one 
head variable. 

Then, we ask 0(|D|) questions to differentiate head from 
body variables in D. If we do not find head variables in |Z)| 
then we may assume that x is a head variable and all vari- 
ables in D are body variables. Once we learn one existential 
Horn expression, we process the remaining existential vari- 
ables in ^. If a variable depends on any one of the body 
variables, B, of a learned existential Horn expression, it is a 
head variable to all body variables in B. 



Algorithm 5 Get Head 



x: an existential variable 
D: the dependents of x, \D\ > 1 
Di ^ D,D2 ^0,1)3 ^0 
while Di / do 

isAnswer ^— Ask(MatrixQuestion(a::, I^i)) 
if isAnswer then 

if 1^1 1 = 2 A D2 = then return Di 
else if l^il > 2 A D2 = then 

Split Di into Di (1^* half) and D3 (2^^ half) 
else if ID2I = 1 then return D2 
else 

Split D2 into D2 (1^* half) and D3 (2^^ half) 
Di^ Di- D3 
end if 
else 

if D3 = then return 

else if ID3I = 1 then return D3 

else 

Split D3 into D2 (1^* half) and D3 (2^^ half) 
Di ^ DiU D2 
end if 
end if 
end w^hile 



Suppose a query has m distinct existential expressions 
with ki,...,km variables each, then X^I^i ^^ < ^- The 
size of each set of dependent variables for each expression 
i is ki — 1. So the total number of questions we ask is 
J2Zi{0{hlgn) + 0{\gh) + 0{h))=0{n\gn) 

Note, however, that each matrix question has 0(|D|) = 
0{n) tuples of n variables each and therefore requires O(n^) 
time to generate. If we limit the number of tuples per ques- 
tion to a constant number, then we increase the number of 
questions asked to Q(n^). 

Lemma 3.4. Q(n^) membership questions, with a con- 
stant number of tuples each, are required to learn existential 
expressions. 

Proof: Consider the class of queries on n variables such 
that all variables in the set X — {xi,Xj} are body variables 
and the pair x^, Xj are head variables. Thus, the target class 
for our learning algorithm is the set of queries of this form: 



3a. 



: A 3{Cij A x^) A 3Cij -^ Xj A 3{Cij A Xj) 



where dj = X — {xi, Xj} and 3{Cij A x^), 3{Cij A Xj) are 
guarantee clauses. 

Any algorithm that learns such a query needs to determine 
exactly which of the possible (2) pairs of variables in n is 
the head variable pair. If we only have a constant number 
c of tuples to construct a question with, we need to choose 
tuples that provide us with the most information. 

We can classify tuples as follows: 

Class 1: Tuples where all variables are true. Any question 
with such a tuple provides us with no information as the user 
will always respond that the set is an answer, regardless of 
the variable assignments in the other tuples. 

Class 2: Tuples where one variable is false. A question 
with one such tuple is always a non-answer as there are no 
tuples that satisfy 3{Cij A Xi) A 3(Cij A Xj). We denote a 
tuple where Xi is false as T^. A question with two or more 
class-2 tuples is an answer if it has two tuples Ti,Tj such 
that Xi,Xj are the head variables. If such a question is a 
non-answer, then for all the tuples Ti,Tj in the question. 



the pair of variables Xi , Xj are not the pair of head variables 
in the target query. 

Class 3: Tuples where more than one variable is false. If 
a question has only class-3 tuples, then it will always be 
a non-answer. We denote a tuple where at least Xi,Xj are 
false as Tij . If Xi , Xj are head variables then we need two 
more tuples Ti and Tj for the question to be an answer. We 
cannot augment the question with more class-3 tuples to 
change it to an answer. Any other T/^ tuple, where T/^ / Tij 
is bound to have at least one of its false x^ , x'j variables 
as body variables: this makes dj false, thus violating the 
clause 3{Cij A xi) A 3{Cij A Xj). So questions with only 
class-3 tuples will always be non-answers and we improve 
the information gain of a question with some class-3 tuples 
by replacing those tuples with class- 2 tuples. 

Therefore, with c tuples per question, we gain the most 
information from questions with only class- 2 tuples. With 
c > 2, we construct a question with c tuples such that for 
each variable Xi ^ H, H C X and \H\ = c the questions 
contains tuple T^. If the question is an answer, then we nar- 
rowed our search for the pair of head variables to the set 
H. If the question is a non-answer, then we are sure that 
all pairs of Xi , Xj variables with tuples Ti , Tj in the question 
do not form the pair of head variables, so we eliminated (2) 
pairs from consideration as head variables. An adversary 
will always respond to such questions with 'non-answer'. In 

the worst case, we have to ask 



A tuple has an upset and a downset. These are visually il- 
lustrated in [Fig. 4| If a tuple is not in the upset or downset 
of another tuple, then these two tuples are incomparable. 



tions. D 



m 
(2) 



r2(n^) ques- 



3.2 Learning role-preserving qhorn 

Since some queries are more complex than others within 
the role-preserving qhorn query class it is natural to allow 
our learning algorithm more time, more questions and more 
tuples per question to learn the more complex target queries. 
One can argue that such a powerful learning algorithm may 
not be practical or usable as it may ask many questions with 
many tuples each. If we assume th at user quer ies tend to be 
simple (i.e they are small in size k (Def. 2.5) and have low 



causal densities (Def. 2.6) ), then such an algorithm can be 
effective in the general case. 

Role-preserving qhorn queries contain two types of ex- 
pressions: universal Horn expressio ns (Vxi X2-.. -^ h) and 
existential conjunctions (3xiX2...) ( |Fig. 3| describes role- 
preserving qhorn terminology). In this section, we show that 
we can learn all universal Horn expressions with 0{n ^^) 
questions and all existential conjunctions with 0(kn\gn) 
questions. We show lower bounds of r^(^^~^) for learning 
universal Horn expressions and Q.(nk) for learning existential 
conjunctions. Since run-time is polynomial in the number of 
questions asked, our run-time is poly(nk) and poly(n ) re- 
spectively. By setting a constant upper limit on the causal 
density of a head variable we can learn role-preserving qhorn 
queries in poly{nk) time. 

We employ a Boolean lattice on the n variables of a 
query to learn the query's expressions. [Fig. 4| illustrates 
the Boolean lattice and its key properties. Each point in 
the lattice is a tuple of true or false assignments to the vari- 
ables. A lattice has n + 1 levels. Each level / starting from 
level consists of tuples where exactly / variables are false. 
A tuple's children are generated by setting exactly one of the 
true variables to false. Tuples at / have out- degree of n — /, 
i.e. they have n — I children and in-degree of / or / parents. 




Figure 4: The Boolean lattice on four variables. 

The gist of our lattice-based learning algorithms is as fol- 
lows: 

1. We map each tuple in the lattice to a distinct expression. 
This mapping respects a certain generality ordering of 
expressions. For example, the lattice we use to learn 
existential conjunctions maps the top tuple in the lattice 
to the most specific conjunction 3xiX2...Xn; tuples in 
the level above the bottom of the lattice map to the 
more general conjunctions 3xi, 3x2, •••, 3xn (§3.2.2)| 
The exact details of this mapping for learning universal 
Horn expressions and learning existential conjunctions 
are described in the following section. 

2. We search the lattice in a top-to-bottom fashion for the 
tuple that distinguishes or maps to the target query ex- 
pression. The learning algorithm generates membership 
questions from the tuples of the lattice and the user's 
responses to these questions either prune the lattice or 
guide the search. 

3.2.1 Learning universal Horn expressions 

We first determine head variables of universal Horn ex- 
pressions. We use the same algorithm of (§3.1.1) 
gorithm uses 0{n) questions. We then determine 
head variables. To determine if h is body less, we construct 
a question with two tuples: 1^ and a tuple where h and all 
existential variables are false and all other variables are true. 
If the question is a non- answer then h is bodyless. If h is not 
bodyless then we utilize a special lattice (Fig. 5) to learn /I's 
different bodies. In this lattice, we neutralize the effect of 
other head variables by fixing their value to true and we fix 
the value of h to false. 

Definition 3.4. A universal Horn expression for a given 
head variable h is distinguished by a tuple if the true vari- 
ables of the tuple represent a complete body for h. 

Thus, each tuple in the lattice distinguishes a unique uni- 
versal Horn expression. For example, consider the target 
query: 

VxiX4 -^ X5 VX3X4 -^ X5 \IX\X2 -^ Xq 
3X1X2X3 3X2X3X4 3X1X2X5 3X2X3X5X6 

In the target query, the head variable X5 has two universal 
Horn expressions: 

VxiX4 -^ X5 VX3X4 -^ X5 



The al- 

bodyless 



In |Fig. 5] we marked the two tuples that distinguish xs's uni- 
versal Horn expressions: 100101 and 001101. Notice that the 
universal Horn expressions are ordered from most to least 
specific. For example the top tuple of the lattice in [Fig. 5| is 
the distinguishing tuple for the expression VX1X2X3X4 -^ X5. 
While the bottom tuple is the distinguishing tuple for the 
expression Vxs. Our learning algorithm searches for distin- 
guishing tuples of only dominant universal Horn expressions. 
A membership question with a distinguishing tuple and 
the all-true tuple (a tuple where all variables are true) is 
a non-answer for one reason only: it violates the universal 
Horn expression it distinguishes. This is because the all-true 
tuple satisfies all the other expressions in the target query 
and the distinguishing tuple sets a complete set of body 
variables to true but the head to false. More importantly, 
all such membership questions constructed from tuples in 
the upset of the distinguishing tuple are non-answers and 
all questions constructed from tuples in the downset of the 
distinguishing tuple are answers. Thus, the key idea behind 
the learning algorithm is to efficiently search the lattice to 
find a tuple where questions constructed from tuples in the 
upset are non-answers and questions constructed from tuples 
in the downset are answers. 



Questions constructed from 
the upset of a distinguishing^ 
tuple are non-answers 



111101 Vx 1X2X3X4 -> X5 
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/ ' . -^ Search roots for more bodies for JCs 

Vx3X4^Xs ^^^ / VxiX4^X> >4 
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Figure 5: Learning bodies for a given head variable 

Given a head variable h and n (non-head) variables, we 
use a Boolean lattice on n variables (with /i = and all 
other head variables set to true). We construct a member- 
ship question with a tuple t from the lattice and the all-true 
tuple — a tuple where all variables, including the head vari- 
able, are true. We begin by describing how we can use the 
lattice to find just one set of body variables that determine 
h with 0{n) questions. We start at the top of the lattice, 
we construct a question from the top tuple and proceed as 
follows: 

1. If the question is an answer, then it does not contain an 
entire set of body variables that determine h. We prune 
its downset. We move to the next tuple on the same level 
of the lattice. 

2. If the question is a non-answer then some of the true 
variables in t form a body and we move down the lattice 
(skipping previously pruned tuples). If all of t's children 
are answers, then t is a universal distinguishing tuple for 
the head variable h. 



This lattice-based algorithm is equivalent to the simple 
procedure listed in Algorithm [6] 

Algorithm 6 Learn one body for head variable h 

h: the head variable we are learning a body for 
N: the set of all non-head variables 
H: the set of all head variables 

X ^— > X holds all variables that are not in the body. 

for X £ N do 

SetTuple(to, A^ U if , T) > This is the all-true tuple. 

ti ^to 

SetTuple(f 1 ,XU{x,h}, F) 

isAnswer ^— Ask({fo,^i}) 

if not isAnswer then 
X ^XUx 

end if 
end for 
return N — X 

Once we find a body, we can safely eliminate its upset. 
Any body in the upset is dominated (Rule 2) by the discov- 
ered distinguishing tuple. Looking at |Fig. 5] we notice that 
the upset simply contains all tuples where all body variables 
of the distinguishing tuple are true. The remaining lattice 
structure is rooted at tuples where one of the body variables 
is false. Since two incomparable bodies need to differ on at 
least one body variable, we set one body variable to false 
and search the resulting sub-lattices for bodies. 

Theorem 3.5. 0(n^) membership questions, where is 
the causal density of the given head variable h, are sufficient 
to the learn the universal Horn expressions of h. 

Proof: Let hi denote the number of body variables for each 
distinguishing tuple ti found. Initially we set 60 to n and we 
search the entire lattice or the n sub-lattices rooted at the 
tuples where exactly one Boolean variable is false. In |Fig. 5] 
those are the tuples at level 1: {Olll^i, lOll^f, llOl^i, 
lllO^i}. 

If the first distinguishing tuple found has | Bi \ true vari- 
ables, then we need to search \Bi\ sub- lattices for bodies. 
For example, after finding the distinguishing tuple OOll^i, 
we continue searching for more distinguishing tuples from 
|Bi| = 2 roots: {IIOIOI, lllOOl}. 

Suppose we find a second distinguishing tuple: 1^^101 
with B2 body variables; then we need to search for 
more bodies in the sub-lattices rooted at tuples where 
one of each body variable from the distinct bodies 
are set to false. Our new \Bi\ x \B2\ roots are: 
{OlOl^i, Oll O^i, l OlO^i, lllO^f}. These search roots 3ie 
illustrated in |Fig. 5] 

In the worst case, we ask 0(n) questions to find a body. 
Thus to determine all expressions for a universal head 
variable, an upper bound on the number of questions, Q, is: 



Q< 



(n) + (|Bi|+n) + (|Bi| X |B2|+n) + ...+ 



iBil X IB2I X 



A) 



Q<nO + J2^Yl \B^\) <nO + ^(n^) = O(n') D 

6=1 2=1 2 = 1 

Since there are 0{n) head variables and for each head 
variable we ask 0(n ) questions to determine its univer- 
sal Horn expressions, we learn all universal Horn expression 
with 0(n X n^) — 0(n^^^) questions. 



Theorem 3.6. ^((f) ~^) membership questions, where 
is causal density of h, are required to the learn the uni- 
versal Horn expressions of h. 

Consider the class of role-preserving queries with uni- 
versal Horn expressions and n body variables where each 
expression d for 1 < i < consists of -^^ body variables, 
Bi^ that determine h. The expression Ce consists of n — ^ — 1 
body variables, Bq^ such that \Be ^ Bi\ — ^^ — 1. 

The following is an example instance with n — 12 body 
variables and — A expressions: 



VX1X3X5X9 -^ h VX2X4X6X10 — 

\/XiX2X^XaX7XsXqXioXii -^ k 



h VX7X8X11X12 -^ h 



If we construct a tuple where two or more variables from 
each body Bi and the head variable are false, then all ^ — 1 
bodies and Be are not satisfied and a question consisting of 
such a tuple and the all-true tuple (1^+^) will always be an 
answer. 

Alternatively, if we set all of variables of one body Bi to 
true and the head variable to false then a question with such 
a tuple will always be a non-answer. Therefore, we can only 
set exactly one variable from each body to false to learn Be . 
There are j^ choices per body for which body variable to 

be false, leaving us with (^^j)^"^ possible questions. If a 
question is an answer, we eliminate only one combination of 
body variables for Be and if the question is a non-answer 
then Be consists of all body variables that are true. In the 
worst-case the user responds that each question is an an- 
swer forcing the algorithm to ask (^)^"^ - 1 = 0((f )^"^) 
questions. D 

3.2.2 Learning existential conjunctions 

To learn existential conjunctions of a query we use the full 
Boolean lattice on all n variables of a query (including head 
variables) . 

Definition 3.5. An existential conjunction C is distin- 
guished by a tuple if the true variables of the tuple are the 
variables of the conjunction. 

Thus, each tuple in the lattice distinguishes a unique exis- 
tential conjunction. For example, consider the target query: 

VxiX4 -^ Xb VX3X4 -^ X5 VxiX2 -^ Xq 
3X1X2X3 3X2X3X4 3X1X2X5 3X2X3X5X6 

The conjunction 3x2X3X5X5 is distinguished by the tuple 
011011 in a six- variable Boolean lattice. 

Existential conjunctions are ordered from most to least 
specific on the lattice. For example, the top tu- 
ple mill of a six- variable lattice is the distinguish- 
ing tuple for the expression 3x1X2X3X4X5X5; the tuples 
{00001, 000010, 000100, 001000, 010000, 100000} at level five 
of the lattice are the distinguishing tuples for the expressions 
3x6, 3x5, 3x4, 3x3, 3x2, 3xi respectively. 

Our learning algorithm searches for distinguishing tuples 
of a normalized target query. For example, the target query 
above is normalized to the following semantically equivalent 
(Rule 3)1 



query usmg 



VxiX4 -^ X5 VX3X4 -^ X5 VxiX2 -^ Xq 
3X1X2X3X6 3X2X3X4X5 3X1X2X5X6 3X2X3X5X6 



(2) 



This query has the following dominant conjunctions (which 
include guarantee clauses): 

3x1X4X5 3x1X2X3X6 3x2X3X4X5 3x1X2X5X6 3x2X3X5X6 

A membership question with all dominant distin- 
guishing tuples of a query is an answer: all ex- 
istential conjunctions (including guarantee clauses) are 
satisfied. For example, a question with the tuples: 
{100110,111001,011110,110011,011011} is an answer for 
the target query above (pi). 

Replacing a distinguishing tuple with its children results 
in a no n- answer: the existential conjunction of that tuple is 
no longer satisfied. For example replacing 011011 with its 
children {001011, 010011, 011001, 011010} results in a mem- 
bership question where none of the tuples satisfy the expres- 
sion 3x2X3X5X6. 

Replacing a distinguishing tuple with any tuple in its up- 
set that does not violate a universal Horn expression still 
results in an answer. 

Thus, the learning algorithm searches level- by- level from 
top-to-bottom for distinguishing tuples by detecting a 
change in the user's response to a membership question from 
answer to non-answer. The efficiency of the learning algo- 
rithm stems from pruning: when we replace a tuple with its 
children, we prune those down to a minimal set of tuples 
that still dominate all the distinguishing tuples. 

We describe the learning algorithm (Alg. l7|) with an ex- 
ample and then prove that the learning algorithm runs in 
0{kn\gn) time (Theorem. 3.8). We also prove the algo- 
rithm's correctness (??) and provide a lower bound o f 0{nk) 
for learning existential conjunctions ( [Theorem. 3.9[ ). 

Algorithm 7 Find Existential Distinguishing Tuples 

T^ {1^} > The top tuple. 

D ^— {} > D is the set of discovered distinguishing tuples. 

while T / do 

for t G T do 

C ^ Children(f) 

C ^— RemoveUniversalHornViolations(C) 

T ^T -{t} 

isAnswer ^ Ask(D UTUCUT') 

if isAnswer then 

T' ^T' U Prune(C, T U D) 
else 

D ^ DU{t} 
end if 
end for 
T ^T' 
end while 
return D 



Suppose we wish to learn the existential conjunctions 
of the target query listed in ([2|. We use the six- variable 
Boolean lattice with the following modification: we remove 
all tuples that violate a universal Horn expression. These 
are tuples where the body variables of a universal Horn ex- 
pression are true and the head variable is false. For example, 
the tuple 111110 violates \lx\X2 -^ xq is therefore removed 
from the lattice. 

Level 1: We start at the top of the lattice. Since the tuple 
mill will satisfy any query, we skip to level one. We now 
construct a membership question with all the tuples of level 
1 (after removing the tuples that violate universal Horn ex- 
pressions: 111110,111101): 111011,110111,101111,011111. 



Algorithm 8 Prune 



T: the tuples to prune 

O: other tuples 

X ^— {} > K is the set of tuples to keep. 

Split T into Ti (1^* half) and T2 (2^^ half). 

while Ti U T2 / do 

is Answer ^ Ask(Ti U K U O) 
if isAnswer then 

Split Ti into Ti (1^* half) and T2 (2^^ half). 
else 

if IT2I = 1 then 
K ^ KUT2 
else 

Add 1^* half of T2 to Ti. Set r2 to 2^^ half of T2. 
end if 
end if 
end while 
return K 



If such a question is a non- answer, then the distinguishing 
tuple is one level above and the target query has one domi- 
nant existential conjunction: 3x1X2X3X4X5X6. 



Children of 111111 




Pruned set of tuples: After pruning the children of Tuples that violate universal Horn 
111111, these tuples remain. They dominate all expressions are removed from the lattice 

distinguishing tuples 

If the question is an answer, we need to search for tuples 
we can safely prune. So we remove one tuple from the ques- 
tion set and test its membership. Suppose we prune the 
tuple 110111, the question is still an answer since all con- 
junctions of the target query are still satisfied: the remaining 
set of tuples still dominate the distinguishing tuples of the 
target query. 

We then prune 011111. This question is a non-answer 
since no tuple satisfies the clause 3x2X3X4X5. We put 011111 
back in and continue searching at level one for tuples to 
prune. We are left with the tuples: 111011, 101111 and 
011111. Note that we asked 0{n) questions to determine 
which tuples to safely prune. We can do better. In partic- 
ular, we only need O(lgn) questions for each tuple we need 
to keep if we use a binary search strategy. 
Level 2: We replace one of the tuples, 111011, with its 
children on level 2: {011011,101011,110011,111001}. Note, 
that we removed 111010 because it violates VxiX2 -^ xe- As 
before we determine which tuples we can safely prune. We 
are left with {110011, 111001}. 



[^w]-- 




A membership question with all the tuples 
highlighted in yellow is an answer 



Tuples that violate universal Horn 
expressions are removed from the lattice 



Similarly we replace 101111 with its children on level 
2: {001111,100111,101011,101110}. We did not con- 
sider 101101 because it violates VX3X4 -^ X5. We can 
safely prune the children down to one tuple: 101110. 
We then replace 011111 with its children on level 2 
and prune those down to {011011,011110}. At the 
end of processing level 2, we are left with the tuples: 
{110011,111001,101110,011011,011110}. We repeat this 



process again now replacing each tuple, with tuples from 
level 3. 

Level 3: When we replace 011110 with its chil- 
dren {010110,011010,001110}, we can no longer satisfy 
3x2X3X4X5. The question is a non-answer and we learn that 
011110 is a distinguishing tuple and that 3x2X3X4X5 is a 
conjunction in the target query. Note that we did not con- 
sider the child tuple 011100 because it violates the universal 
Horn expression VX3X4 ^ X5. We fix 011110 in all subse- 
quent membership questions. 

Fix distinguishing tuple in all following membership questions 

3 X2X3X4X5 y^ 




Replacing 011110 with its children results in a non- 
answer: we learn that '01111 0' is a distinguishing 



Remove tuples that violate 
universal Horn Expressions. 



When we replace 011011 with its children 
{001011,010011,011001,011010}, we can no longer satisfy 
3x2X3X5X6. The question is a non-answer and we learn 
that 011011 is a distinguishing tuple and that 3x2X3X5X5 
is a conjunction in the target query. We fix 011011 in all 
subsequent membership questions. 

When we replace 111001 with its children 
{011001,101001,110001}, the question is a non-answer, 
and we learn that 111001 is distinguishing tuple and that 
3x1X2X3X6 is a conjunction in the target query. Note that 
we did not consider the tuple 111000 because it violates 
VxiX2 -^ xq. We fix 111001 in all subsequent membership 
questions. 

We can replace 101110 with the children 
{001110,100110,101010}. Note that the child 101100 
is removed because it violates VxiX4 -^ X5. We can safely 
prune the children down to one tuple 100110. 

When we replace 110011 with its children 
{010011,100011,110001}, we can no longer satisfy 
3x1X2X5X5. Thus, the question is a non-answer and 
we learn that 110011 is a distinguishing tuple. Note that 
we did not consider the tuple 110010 because it violates 

VxiX2 -^ X6. 

At this stage, we are left with the following tuples: 

{110011, 100110, 111001, 011011, 011110} 

At this point, we can continue searching for conjunctions 
in the downset of 100110 which is the distinguishing tuple for 
a known guarantee clause for the universal Horn expression: 
VX1X4 ^^ X5. As an optimization to the algorithm, we do 
not search the downset because all tuples in the downset are 
dominated by 1001 IcQ 

The algorithm terminates with the following distinguishing tuples. 

(011011) [oiiiiq] 101110 [110011) [111001 ) 




After pruning the children of '1 01 1 1 0', only '1 001 1 0' is left. Since '1 001 1 0' is a 
distinguishing tuple for 3^1X4X5 the guarantee clause of Vx 1X4^X5, we do not continue 
down the lattice. 

The learning algorithm terminates with the following dis- 
tinguishing tuples {110011, 100110, 111001, 011011, 011110} 

^We can relax the requirement of guarantee clauses for uni- 
versal Horn expressions and our learning algorithms will still 
function correctly if they are allowed to ask about the mem- 
bership of an empty set. 



which represent the expressions: 

3x1X2X5X6 3x1X4X5 3x1X2X3X6 3x2X3X5X6 3x2X3X4X5 

Theorem 3.7. The lattice-based learning algorithm finds 
the distinguishing tuples of all dominant existential conjunc- 
tions of a normalized target query. 

The algorithm will find a dominant distinguishing tuple 
if there exists a path from at least one tuple in its current 
pruned set of tuples to the distinguishing tuple. A path 
between two tuples to, ti is simply the sequence of variables 
to set to false to get from to to ti. When pruning, we ensure 
that the pruned set of tuples dominates all the distinguishing 
tuples: so there always exists a path from at least one tuple 
in the pruned set of tuples to a distinguishing tuple provided 
our lattice is complete. 

To see that removing tuples that violate universal Horn 
expressions will not impact the existence of a path, suppose 
a tuple ta in the pruned set dominates a distinguishing tuple 
td and ta does not violate any universal Horn expressions. 
We relabel all the no n- head variables in ta to eg...e^ and 
all the head variables to /ig.../i^_^. Similarly, we relabel 
the non-head variables in td with eo...e^ and the head vari- 
ables with hQ...hn^- Consider the tuple tb which has the 
values of eQ...e^ for its non-head variables and the values 
of hQ...hn-m for its head variables. Clearly there exists a 
path from tuple ta to tuple tb that does not encounter any 
tuples that violate universal Horn expressions because the 
head variables in both ta and tb have the same values. 

Similarly, there exists a path from tb to td- Since tb is 
in the upset of t^, /lo-./in-m dominates hQ...hn-m- In a 
normalized query, all existential conjunctions are expanded 
to include head variables t hat are implied by the variables 
of 



conjunction (Rule 3) 



Thus, td does not violate any 
universal Horn expression and the path from tb to td only 
sets head variables to false that do not violate any universal 
Horn expressions. 

Suppose the learning algorithm favors another tuple tc 
(instead of tb) such that eg...e^ is in the downset of eg...e^ 
and in the upset of eo...e^ and /ig.../i^_^ is in the downset 
of hQ...hn-m and in the upset of /io---/in-m- If the learning 
algorithm reaches tc then tc does not violate any universal 
Horn expressions and by induction (let ta = tc) there also 
exists a path from tc to td- D 

Theorem 3.8. The lattice-based learning algorithm asks 
0{kn\gn) membership questions where k is the number of 
existential conjunctions. 

Proof: Consider the cost of learning one distinguishing 
tuple ti at level l. From the top of the Boolean lattice to ti, 
there is at least one tuple ti at each level i {0 < i < I) that 
we did not prune and we traversed down from to get to ti. 
Let Ni be the set of t^'s siblings. At each level i, we asked at 
most Ig \Ni\ questions. \Ni\ = n — (i — 1) or the out-degree 
of A/'i's parent. In the worst-case, I — n^ and the cost of 
learning ti is YJ^^^^ lg(n - {i - 1)) < X)Li Ig^ = O(nlgn). 
With k distinguishing tuples we ask at most 0{kn Ig n) ques- 
tions, n 

Theorem 3.9. Q{nk) questions are required to learn ex- 
istential conjunctions. 

From an information theoretic perspective, Q{nk) ques- 
tions is a lower bound on the number of questions needed 



to learn existential expressions. Consider level n/2 of the 
Boolean lattice, which holds the maximum number of non- 
dominated distinguishing tuples. There are (^^2) ^^pl^s at 
this level. Suppose we wish to learn k existential expressions 

at this level. There are {^'^l'^^) possible /c-expressions. Since 
each question provides a bit of information, a lower bound 

on the number of questions needed is Ig ( /^ )• 
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4. QUERY VERIFICATION 

A query verifier constructs a set of membership questions 
to determine whether a given query is correct. The verifier 
will not find an alternate query if the query is incorrect. 
Thus, while query learning is a search problem — a learner 
searches for the one correct query that satisfies the user's 
responses to membership questions; query verification is the 
decision problem — a verifier decides if a given query is 
correct or incorrect given the user's responses to membership 
questions. 

Our approach to query verification is straightforward: for 
a given role-preserving qhorrj^ query Qg , we generate a veri- 
fication set of 0{k) membership questions, where k is the 
number of expressions in Qg. Note that our learning al- 
gorithm for role-preserving qhorn queries asks 0(n^^^ + 
knlgn) questions. If the user's intended query Qi is seman- 
tically different from the given query Qg, then for at least 
one of the membership questions M in the verification set 
q,{M)^q,{M). 

Proposition 4.1. A user's intended query qi is seman- 
tically different from a given query qg iff qi and q g have dis 
tinct sets of existential (Def. 3.5) and universal (Def. 3.4) 
distinguishing tuples. 

Suppose we try to learn the two role-preserving qhorn 
queries qi and qg. If qi and qg are semantically different, 
then our learning algorithm will terminate with distinct sets 
of existentia (Def. 3.5) and universa ](Def. 3.4~ distinguishing 
tuples for each query. The verification set consists of mem- 
bership questions that detect semantic differences between 
two queries by detecting differences in their respective sets 
of distinguishing tuples. |Fig. 6| lists six types of membership 
questions from which the verification algorithm constructs a 
verification set for a given query. 

We ex plain h ow to construct each question for an example 



query m 



4.1 Normalizing User-specified Queries 

All membership questions are constructed from dominant 
existential (Rule 1) or universal (Rule 2) expressions. A 
user-specified query qg may contain redundant or dominated 
expressio ns. For example 3xiX2 is dominated by 3x1X2X3 
(Rule 1) and is therefore redundant. 



4.1.1 Dominant Existential Distinguishing Tuples 

To find the dominant existential expressions a simple rou- 
tine orders all existential expressions by the number of par- 

^ Since qhorn- 1 is a sub-class of role-preserving qhorn, our 
verification approach works for both query classes. 



Answers 



Non-Answers 



Membership Questions 



#of 
Questions 



Tuples I 
Question 



A1 Distinguishing tuples for all dominant existential expressions (including guarantee clauses 

and existential Horn expressions)^ 
A2 For each dominant universal Horn expression: 

(i) a tuple where all variables are true 

(ii) Children of the distinguishing tuple 
A3 For each dominant existential expression on C variables such that there are one or more 

universal Horn expressions VFi -^ h ... VBg -^ h where Ft c C for i = 1 ... 6: 

(i) a tuple where all variables are true 

(ii) Search roots: a tuple where one body variable from each body B^ ... Bg is false and all 

other variables in C are true and h is false 
A4 (i) A tuple where all variables are true 

(ii) A tuple for each non-head variable x such that x is false and all other variables are true 



0(1) Oik) 

0(/c) 0(n) 

0(/c) O(n^) 

0(1) 0(71) 



Membersliip Questions 



# of Tuples I 
Questions Question 



N1 For each distinguishing tuple in A1 

that is not due to a guarantee 

clause: 

(i) Children of the distinguishing 

tuple^ 

(ii) All other tuples from A1 
N2 For each dominant universal Horn 

expression: 

(i) a tuple where all variables are 

true 

(ii) The distinguishing tuple 



Oik) 0(n + k) 



0(k) 0(1) 



^In constructing these questions, we do not violate universal Horn expressions: i.e. we set a head variable to true if the existential expression contains a body for the head variable 

Figure 6: Membership questions of a verification set. 



ticipating variables from largest to smallest. For each ex- 
pression in the ordered list, we remove all other expressions 
in the list whose participating variables are a subset of the 
variables of the current expression. This leaves us with the 
set of dominant existential expressions. 

To construct a distinguishing tuple from an existential ex- 
pression, we set all participating variables of the expression 
to true and the remaining variables to false. If setting one 
of the remaining variables to false violates a universal Horn 
expression, we set it to true. This is equivalent to rewriting 
the given query 3xiX2 Vxi -^ /i to a semantically equivalent 
query 3xiX2h Vxi -^ h (Rule 3) 



4.1.2 Dominant Universal Distinguishing Tuples 

A user-specified query qg may also contain redundant uni- 
versal Horn expressions. For example \lx\X2 -^ x^ is domi- 
nated by Vxi -^ X3 (Rule 2)] To find the dominant universal 
distinguishing expressions, a simple routine orders all uni- 
versal Horn expressions by the number of the participating 
variables from smallest to largest. For each expression in 
the ordered list, we remove all other expressions in the list 
whose participating variables are a superset of the variables 
of the current expression. This leaves us the set of dominant 
universal Horn expressions. 

To construct a universal distinguishing tuple from a uni- 
versal Horn expression, we set the head variable to false and 
all body variables of the expression to true. The remain- 
ing head variables are set to true and the remaining body 
variables are set to false. 



4.2 Example Verification Set 

We demonstrate the construction of a verification set on 
a role-preserving qhorn query with six Boolean variables 
xi, ..., X6. This is the same query that we previously learned 

in |(§3.2.2) 



VxiX4 -^ X5 VxiX2 -^ Xq VX3X4 -^ X5 
3X1X2X3 3X2X3X4 3X1X2X5 3X2X3X5X6 



[Al] For each existential expression and guarantee clause, 
we construct the following distinguishing tuples. 



3x1X2X3 
3x2X3X4 
3x1X2X5 
3x2X3X5X6 

3x1X4X5 

3xiX2X6 

3X3X4X5 



111001 
011110 
110011 
011011 

100110 
110001 
001110 



Do not violate \lx\X2 -^ xq 
Do not violate VX3X4 -^ X5 
Do not violate \lx\X2 -^ xq 

The guarantee clause of: 
VX1X4 -^ X5 

VxiX2 -^ Xq 
VX3X4 -^ X5 



We eliminate the last two tuples {110001, 001110} as they 
are no n- dominant. Therefore, Al is 

111001 
011110 

noon 

011011 
100110 



[Nl] For each dominant distinguishing tuple of an existen- 
tial expression, we construct the following four questions by 
replacing each distinguishing tuple with its children: 



3X1X2X3(X6) 


3X2X3X4(X5) 


3X1X2X5(X6) 


3X2X3X5X6 


110001 


111001 


111001 


111001 


101001 


011010 


011110 


011110 


011001 


010110 


110001 


noon 


011110 


001110 


100011 


011010 


noon 


noon 


010011 


011001 


011011 


011011 


011011 


010011 


100110 


100110 


100110 


001011 

100110 



To avoid violating Horn expressions, we set affected head 
variables (in brackets) to true. We bold out the children of 
each dominant distinguishing tuple. Since the existential 
expression is not satisfied by any of the tuples in a question, 
the question is a non-answer. 

[A2] For each dominant universal Horn expression, 
we construct the following distinguishing tuples: 

VxiX4 ^ X5 ^ 100101 

VX3X4 ^ X5 ^ 001101 

VX1X2 ^ X6 ^ 110010 

A2 questions consist of children of the universal distin- 



guishing tuples: 

VxiX4 



• X5 VX3X4 -^ X5 VxiX2 



X6 



111111 

100001 
000101 



111111 

001001 
000101 



111111 

100010 
010010 



[N2] For each universal distinguishing tuple, we con- 
struct the following questions 



VX1X4 -^ X5 VX3X4 -^ X5 \lx\X2 



Xq 



111111 

100101 



111111 

001101 



mill 
110010 



[A3] First, we find existential expressions that domi- 
nate the guarantee clauses of a universal Horn expres- 
sion. 3x2X3X4X5 dominates the guarantee clause 3x3X4X5 of 
VX3X4 -^ X5. Note that 3x2X3X4X5 is implied by the expres- 
sions 3x2X3X4 VX3X4 -^ X5 in the query (Rule 3) Second, 



we generate the question by generating search roots for the 
body X3X4 within the sub- lattice rooted at 011101. 

111111 
010101 
111001 

If in the intended query X5 has another body in X2X3X4 
that is incomparable with X3X4 the above question will be 
a non- answer. 

[A4] The query has four non- head variables {xi , X2, X3, X4}. 
So we construct the following question. 

mill 

011111 
101111 
110111 
111011 

4.3 Completeness of a Verification Set 



Question 


\/Xi\/X2 


3x-i3x2 


3X-,X2 
SX^ ^ X2 


\/Xi -^ X2 


\/X2 -^ Xi 


yx-i3x2 


3Xi\/X2 


A1 




{10,01} 












A2 








{11,00} 


{11,00} 






A4 


{11} 


{11,01, 10} 


{11,01, 10} 


{11,01} 


{11,10} 


{11,10} 


{11,01} 


N1 






{10,01} 










N2 


1:{11,01} 
2: {11, 10} 






{11,10} 


{11,01} 


{11,00} 


{11,00} 



Figure 7: Verification sets for each role-preserving qhorn 
query on two variables 



|Fig. 7| illustrates the verification sets of all role-preserving 
queries on two variables. |Fig. 8| illustrates which member- 
ship questions in the verification sets enable the user to de- 
tect a discrepancy between the given query and the query 
they actually intended. Note that with only two variables, 
there is no need to generate A3 questions. These examples 
serve to demonstrate the completeness of the verification 
sets at least for role-preserving queries on two variables. 

Theorem 4.2. A verification set with all membership 
questions of \Fig. ^] surfaces semantic differences between the 
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Figure 8: Membership questions that detect a difference 
betw^een tw^o specific role-preserving qhorn queries on 
two- variables. 



given query Qg and the intended query Qi by surfacing dif- 
ferences between the sets of distinguishing tuples of Qg and 
Qi- 

Proof: Case 1: Qi and Qg have different sets of dominant 
existential distinguishing tuples then by Lemma |4.3| ques- 
tions Al and Nl surface differences in the sets of dominant 
existential distinguishing tuples of Qg and Qi . 

Case 2: Qi and Qg have different sets of dominant universal 
distinguishing tuples then 

1. Both Qi and Qg classify /i as a head variable. Qi has a 
dominant universal Horn expression d : WBi -^ h {B is 
a set of body variables) and Qg has dominant universal 
Horn expressions of the form ^Bg -^ h. 

(a) If for any Bg in Qg, Bi C Bg or Bi D Bg then by 
Lemmas |4.4| and |4.5| questions A2 and N2 will surface 
this difference. 

(b) If for all Bg in Qg, Bi and Bg are incomparable then 
either (i) C^'s guarantee clause dominates Qg^s ex- 
istential expressions and ^^'s set of existential dis- 
tinguishing tuples does not have the distinguishing 
tuple for Ci's guarantee clause (See Case 1) or (ii) 
C^'s guarantee clause is dominated by an existential 
expression in Qg and by Lemma [X6| question A3 sur- 
faces the difference. 

2. /i is a head variable in Qi but is a non- head variable in 
Qg then by Lemma |4.7| question A4 surfaces the differ- 
ence, n 

Lemma 4.3. Let Di be the set of Qi 's dominant existential 
distinguishing tuples and let Dg be the set of Qg ^s dominant 
existential distinguishing tuples; membership questions Al 
and Nl surface Di 7^ Dg . 

Proof: An existential distinguishing tuple represents an 
inflection point: all questions constructed with tuples in the 
distinguishing tuple's upset are answers and all questions 
constructed with only tuples in the rest of the lattice are 
non-answers. We use this feature to detect if Di ^ Dg. 

First, we define the following order relations over Di and 
Dg: 

1. Da 



< Di if for every tuple ti ^ Di, there exists a tuple 



tg G 



Dg such that tg is in the upset of ti . 



2. Dg > Di if all tuples in Dg are in the downset of Di. 

3. Dg\\Di, otherwise, i.e. they are incomparable. 

Since Dg ^ Di only the following cases are possible: 

Case 1: Dg\\Di or Dg > Di: Dg or membership question 
Al is a non-answer to the user's intended query ^^. The user 
will detect the discrepancy as Dg is presented as an answer 
in Qg^s verification set. 

Case 2: Dg < Di. Suppose all tuples in Dg are in the 
upset of one of D^'s tuples. Let Dg{t) be the set of distin- 
guishing tuples where we replace t ^ Dg with its children. 
There are \Dg\ — 0{k) such sets. These sets form mem- 
bership questions Nl. For any t G Dg^ Dg{t) is always a 
non-answer to qg. However, for at least one tuple t, Dg{t) 
is an answer to qi . This is because if Dg < Di then at least 
one of Di^s tuples is a descendant of one of Dg^s tuples, in 
which case Dg (t) is still in the upset of that tuple and thus 
an answer. The user will detect the discrepancy as Dg{t) is 
presented as a no n- answer in ^^'s verification set. D 

Like existential distinguishing tuples, universal distin- 
guishing tuples represent an inflection point. All tuples 
in the upset of the universal distinguishing tuple are non- 
answers (as all of /I's body variables are true but h is false). 
All descendants of the universal distinguishing tuple are an- 
swers (as no complete set of /I's body variables is true). 

Let ti be g^'s universal distinguishing tuple for an expres- 
sion on the head variable h. Let tg be one of ^^'s universal 
distinguishing tuples for expressions on the head variable h. 
We define the following order relations between ti and tg-. 

1. ti < tg if ti is in the upset of tg. 

2. ti > tg if ti is in the downset of tg. 

3. ti\\tg if ti and tg are incomparable. 

Consider two distinct (dominant) tuples t^^ 
given query. By qhorn's equivalence rule^ 



and tg^ are incomparable {tg^ \ 
two distinct tuples both ti 



2.1.1) queries tg 



Consequently, for any 



< tg-^ and ti 



> tg2 cannot hold. 



Lemma 4.4. Membership question A2 detects ti > tg. 

Proof: Suppose, qg has one universal distinguishing tuple 
tg such that ti > tg. Then the membership question A2 that 
consists of the all-true tuple and tg 's children is an answer 
for qg as none of t^'s children have all the body variables set 
to true, so the head variable can be false. If t^ > tg then 
^i's universal Horn expression on h has a strict subset of 
the body variables represented by tg. Therefore, in at least 
one of tg^s children, all of t^'s body variables are set to true 
and h is still false. Thus, A2 is a non-answer to qi. For all 
other universal distinguishing tuples tg of qg , either ti > tg 
or ti\\tg. If ti\\tg then A2 is still an answer. D 

Lemma 4.5. Membership question N2 detects ti < tg. 

Proof: Suppose, qg has one universal distinguishing tuple 
tg such that ti < tg. Then the membership question N2 that 
consists of the all-true tuple and tg is a non-answer for qg as 
tg has all body variables set to true but the head variable 
h is false. If ti < tg then ^^'s universal Horn expression on 
h has a strict superset of the body variables represented by 
tg. Therefore, tg does not have all body variables set to true 
and h can be false. Thus, N2 is an answer to qi. 

For all other universal distinguishing tuples tg of qg, either 
ti < tg or ti\\tg. If ti\\tg thcu N2 is still a non-answer. D 



Lemma 4.6. // 

• h is a head variable in qi and qg . 

• qi has a dominant universal Horn expression VM -^ h 
which qg does not have. 

• qg has universal Horn expressions \/Bi -^ h ...\/Bo -^ h. 

• Bi\\M fori = 1...0 

• qg has an existential expression on C variables (3 C) 
such that C ^ M and C D Bi for i = 1...0 

then A3 surfaces a missing universal Horn expression 
(VM^ h) fromqg. 

Proof: Consider ^^'s universal Horn expressions whose 
guarantee clauses are dominated by 3 C: 



VBi ^h,\/B2 ^h,...\/Be 



h 



such that Bi C C for i — 1...0. To build A3, we set one 
body variable from each of Bi, ..., Be to false, the remaining 
variables in C to true and h to false. There are \Bi\ x \B2\ x 
... X \Be\ = 0{n ) such tuples. A3 now consists of all such 
tuples and the all-true tuple. 

A3 acts like the search phase of the learning algorithm 
that looks for new universal Horn expressions (§3.2.1)] A3 
is a no n- answer for qi as at least one of the tuples has all 
variables in M set to true (because M\\Bi for i — 1...0) and 
h to false, thus violating VM -^ h. D 

Lemma 4.7. If h is a head variable in qi but not in qg 
then question A4 surfaces the difference. 

Proof: The all-true tuple satisfies all existential expres- 
sions in qg. For each body variable x in qg, A4 has a tuple 
where x is false and all other variables are true. If x is a 
head variable in qi, then A4 should be a non-answer. D 

This concludes the proof of jTheorem. 4.2| 

5. RELATED WORK 

Learning &; Verifying Boolean Formula: Our work 
is influenced by the field of computational learning theory. 
Using membership questions to learn Boolean formulas was 
introduced in 1988 [2.. Angluin et al. demonstrated the 
polynomial learnability of conjunctions of (non-quantified) 
Horn clauses using membership questions and a more pow- 
erful class of questions known as equivalence questions [3]. 
The learning algorithm runs in time 0{k^n^) where n is the 
number of variables and k is the number of clauses. Interest- 
ingly, Angluin proved that there is no PTIME algorithm for 
learning conjunctions of Horn clauses that only uses mem- 
bership questions. Angluin et al.'s algorithm for learning 
conjunctions of Horn formula was extended to learn first- 
order Horn expressions |11| [9] . First-order Horn expressions 
contain quantifiers. We differ from this prior work in that 
in qhorn we quantify over tuples of an object's nested rela- 
tion; we do not quantify over the values of variables. Our 
syntactic restrictions on qhorn have counterparts in Boolean 
formulas. Both qhorn- 1 and read- once Boolean formulas p] 
allow variables to occur at most once. Both role-preserving 
qhorn queries and depth- 1 acyclic Horn formulas 8 do not 
allow variables to be both head and body variables. 

Verification sets are analogous to the teaching sequences of 
Goldman and Kearns [7 . A teaching sequence is the small- 
est sequence of classified examples a teacher must reveal to 
a learner to help it uniquely identify a target concept from 



a concept class. Prior work provides algorithms to deter- 
mine the teaching sequences for several classes of Boolean 
formula [5] u\ [16] but not for our class of qhorn queries. 

Learning in the Database Domain: Two recent 
works on example- driven database query learning techniques 
— Query by Output (QBO) 19 and Synthesizing View Defi- 
nitions (SVD) 6 — focus on the problem of learning a query 
Q from a given input database D, and an output view V. 
There are several key differences between this body of work 
and ours. First, QBO and SVD perform as decision trees; 
they infer a query's propositions so as to split D into tuples 
in V and tuples not in V. We assume that users can provide 
with us the propositions, so we focus on learning the struc- 
ture of the query instead. Second, we work on a different 
subset of queries: QBO infers select-project-join queries and 
SVD infers unions of conjunctive queries. Learning unions 
of conjunctive queries is equivalent to learning /c-term Dis- 
junctive Normal Form (DNF) Boolean formulae jTo]. We 
learn conjunctions of quantified Horn formulae. Since our 
target queries operate over objects with nested-sets of tuples 
instead of flat tuples, we learn queries in an exponentially 
larger query and data space. Finally, QBO and SVD work 
with a complete mapping from input tuples to output tuples. 
Our goal, however, is to learn queries from the smallest pos- 
sible mapping of input to output objects, as it is generally 
impractical for users to label an entire database of objects 
as answers or non-answers. We point out that we synthesize 
our input when constructing membership questions, thus we 
can learn queries independent of the peculiarities of a par- 
ticular input database D. 

Using membership (and more powerful) questions to learn 
concepts within the database domain is not novel. For exam- 
ple. Gate, Dalmau and Kolaitis use membership and equiv- 
alence questions to learn schema mappings [iS]. A schema 
mapping is a collection of first-order statements that specify 
the relationship between the attributes of a source and a tar- 
get schema. Another example is Staworko's and Wieczorek's 
work on using example XML documents given by the user 
to infer XML queries 
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In both these works, the concept 
class learned is quite different from the qhorn query class. 

The Efficacy of Membership Questions Learning 
with membership questions is also known as Active Learn- 
ing. Active learning elicits several criticisms due to mixed 
or negative results in some learning problems. We wish to 
address two of the main criticisms: 



1. Arbitrary Examples. Early work by Lang and Baum 13 
used membership questions to train a neural network 
to recognize hand- written digits. They discovered that 
users couldn't reliably respond to the questions — im- 
ages of artificially synthesized hybrids of two digit^ 
This canonical negative result does not apply to our 
work. We synthesize examples from the actual data do- 
main. Moreover, if we have a rich database, we can select 
instances from the database that match our synthesized 
Boolean tuples instead of synthesizing the data tuples. 

2. Noisy Users. The criticism made here is that (i) users 
may not know have a clear idea of what constitutes a pos- 
itive (answer) or negative (non-answer) example or (ii) 



^ Later work by Kudo et al. demonstrates how appropri- 
ately constructed membership questions can boost the per- 
formance of character recognition algorithms .12,. 



users make mistakes. In query specification tasks, users 
typically have a clear idea of what they are looking for. 
This contrasts with data exploration tasks, where users 
search the database without a well-defined selection cri- 
teria. A good user-interface can ameliorate the second 
issue. For example, if we provide users with a history 
of all their responses to the different membership ques- 
tions, users can double-check their responses and change 
an incorrect response. This triggers the query learning 
algorithm to restart query learning from the point of er- 



A survey by Settles discusses recent advances and chal- 
lenges in active learning ^15^. 

6. CONCLUSION & FUTURE WORK 

In this paper, we have studied the learnability of a spe- 
cial class of Boolean database queries — qhorn. We be- 
lieve that other quantified- query classes (other than con- 
junctions of quantified Horn expressions) may exhibit differ- 
ent learnability properties. Mapping out the properties of 
different query classes will help us better understand the lim- 
its of example-driven querying. In our learning/verification 
model, we made the following assumptions: (i) the user's 
intended query is either in qhorn- 1 or role-preserving qhorn, 
(ii) the data has at most one level nesting. We plan to de- 
sign algorithms to verify that the user's query is indeed in 
qhorn- 1 or role-preserving qhorn. We have yet to analyze 
the complexity of learning queries over data with multiple- 
levels of nesting. In such queries, a single expression can 
have several quantifiers. 

We plan to investigate Probably Approximately Correct 
learning: we use randomly-generated membership questions 
to learn a query with a certain probability of error [20| . 
We note that membership questions provide only one bit 
of information — a response to membership question is ei- 
ther 'answer' (1) or 'non-answer' (0). We plan to examine 
the plausibility of constructing other types of questions that 
provide more information bits but still maintain interface 
usability. One possibility is to ask questions to directly de- 
termine how propositions interacl|j such as: "do you think 
pi and p2 both have to be satisfied by at least one tuple?" 
or "when does pi have to be satisfied?" 

Finally, we see an opportunity to create efficient query 
revision algorithms. Given a query which is close to 
the user's intended query, our goal is to determine the 
intended query through few membership questions — 
polynomial in the distance between the given query and 
the intended query. Efficient revision algorithms exist for 
(non-quantified) role-preserving Horn formula 8 . The 
Boolean-lattice provides us with a natural way to measure 
how close two queries are: the distance between the 
distinguishing tuples of the given and intended queries. 
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